Visualization (Exploring variation)

Author

Peter Ganong and Maggie Shi

Published

October 9, 2024

Motivation

Introduction to the next two lectures

Most of our visualization lectures are based on the University of Washington textbook, but the textbook doesn’t have enough material on exploratory data analysis. We therefore are supplementing with the Data Visualization and Exploratory Data Analysis material in the R for Data Science textbook (with the code translated to Altair).

  • diamonds is from “Exploratory Data Analysis”
  • movies is from the UW textbook
  • penguins is from “Data Visualization”

What is exploratory data analysis?

Data visualization has two distinct goals

  1. exploration for you to learn as much as possible
  2. production for you to teach someone else what you think the key lessons are

How do the modes differ?

  • When you are in exploration mode, you will look at lots of patterns and your brain filters out the noise
  • Production mode is like putting a cone on your dog. You are deliberately limiting the reader’s field of vision such that they see the key messages from the plot and avoid too many distractions

The next two lectures are almost entirely about exploration. Then, in lecture 6, we will talk more about making graphics for production.

Caveat: these modes make the most sense when thinking about static visualization. Later on in the course, when we talk about dashboards, this is closer to making interfaces to help readers who don’t code explore the data.

Categorical variables

Categorical variables: roadmap

  • introduce diamonds
  • show table
  • show bar graph

introduce dataset diamonds

from plotnine import *
from plotnine.data import diamonds, mpg
diamonds
carat cut color clarity depth table price x y z
0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43
1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31
2 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31
3 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63
4 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75
... ... ... ... ... ... ... ... ... ... ...
53935 0.72 Ideal D SI1 60.8 57.0 2757 5.75 5.76 3.50
53936 0.72 Good D SI1 63.1 55.0 2757 5.69 5.75 3.61
53937 0.70 Very Good D SI1 62.8 60.0 2757 5.66 5.68 3.56
53938 0.86 Premium H SI2 61.0 58.0 2757 6.15 6.12 3.74
53939 0.75 Ideal D SI2 62.2 55.0 2757 5.83 5.87 3.64

53940 rows × 10 columns

diamonds

diamonds_cut = diamonds.groupby('cut').size()
diamonds_cut
cut
Fair          1610
Good          4906
Very Good    12082
Premium      13791
Ideal        21551
dtype: int64

Categorical variables

diamonds_cut = diamonds_cut.reset_index().rename(columns={0:'N'}) # Prepare to plot

alt.Chart(diamonds_cut).mark_bar().encode(
    alt.X('cut'),
    alt.Y('N')
)

Categorical variables – summary

  • this section is very brief because there’s basically only one good way to plot categorical variables with a small number of categories and this is it.
    • You can use mark_point() instead of mark_bar(), but overall, there’s a clear right answer about how to do this.
  • We include this material mainly to foreshadow the fact that we will do a lot on categorical variables in the next lecture when we get to “Exploring Co-variation”

Continuous variables

Roadmap: Continuous variables

  • histograms using movies
  • histograms and density plots using penguins
  • diamond size (carat)

Remark: The skills are absolutely fundamental and so we will intentionally be a bit repetitive.

movies dataset

movies_url = 'https://cdn.jsdelivr.net/npm/vega-datasets@1/data/movies.json'
movies = pd.read_json(movies_url)

recap scatter plot from lecture 3

alt.Chart(movies_url).mark_circle().encode(
    alt.X('Rotten_Tomatoes_Rating:Q', bin=alt.BinParams(maxbins=20)),
    alt.Y('IMDB_Rating:Q')
)

scatter plot – N movies per bin

alt.Chart(movies_url).mark_circle().encode(
    alt.X('Rotten_Tomatoes_Rating:Q', bin=alt.BinParams(maxbins=20)),
    alt.Y('count(IMDB_Rating):Q')
)

scatter plot – syntax trick

Replace count(IMDB_Rating) with count() because we aren’t using IMDB rating any more.

alt.Chart(movies_url).mark_circle().encode(
    alt.X('Rotten_Tomatoes_Rating:Q', bin=alt.BinParams(maxbins=20)),
    alt.Y('count():Q')
)

histogram using mark_bar()

hist_rt = alt.Chart(movies_url).mark_bar().encode(
    alt.X('Rotten_Tomatoes_Rating:Q', bin=alt.BinParams(maxbins=20)),
    alt.Y('count():Q')
)
hist_rt

Discussion question: how would you describe the distribution of rotten tomatoes ratings?

histogram of IMDB ratings

hist_imdb = alt.Chart(movies_url).mark_bar().encode(
    alt.X('IMDB_Rating:Q', bin=alt.BinParams(maxbins=20)),
    alt.Y('count():Q')
)
hist_imdb

Side-by-side

Discussion question – compare the two ratings distributions. which is more informative?

hist_rt | hist_imdb

introducing the penguins

from palmerpenguins import load_penguins
penguins = load_penguins()
display(penguins)
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 male 2007
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 female 2007
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 female 2007
3 Adelie Torgersen NaN NaN NaN NaN NaN 2007
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 female 2007
... ... ... ... ... ... ... ... ...
339 Chinstrap Dream 55.8 19.8 207.0 4000.0 male 2009
340 Chinstrap Dream 43.5 18.1 202.0 3400.0 female 2009
341 Chinstrap Dream 49.6 18.2 193.0 3775.0 male 2009
342 Chinstrap Dream 50.8 19.0 210.0 4100.0 male 2009
343 Chinstrap Dream 50.2 18.7 198.0 3775.0 female 2009

344 rows × 8 columns

histogram with steps of 200

alt.Chart(penguins).mark_bar().encode(
    alt.X('body_mass_g', bin=alt.BinParams(step=200)),
    alt.Y('count()')
)

histogram step parameter

20 vs 200 vs 200

Discussion q – what message comes from each binwidth choice?

numeric variable: transform_density()

alt.Chart(penguins).transform_density(
    'body_mass_g',
    as_=['body_mass_g', 'density']
).mark_area().encode(
    x='body_mass_g:Q',
    y='density:Q'
)

Back to diamonds, focus on carat

alt.data_transformers.disable_max_rows() # Needed because len(df) > 5000

alt.Chart(diamonds).mark_bar().encode(
    alt.X('carat', bin=alt.Bin(maxbins=10)),
    alt.Y('count()')
)

Continuous Variables

diamonds['bins'] = pd.cut(diamonds['carat'], bins=10)
diamonds.groupby('bins').size()
bins
(0.195, 0.681]    25155
(0.681, 1.162]    18626
(1.162, 1.643]     7129
(1.643, 2.124]     2349
(2.124, 2.605]      614
(2.605, 3.086]       53
(3.086, 3.567]        6
(3.567, 4.048]        5
(4.048, 4.529]        2
(4.529, 5.01]         1
dtype: int64

Continuous Variables: Typical Values

diamonds = diamonds.drop('bins', axis=1) # 'Interval' type causes plotting issues 
diamonds_small = diamonds.loc[diamonds['carat'] < 2.1] # Subset to small diamonds

alt.Chart(diamonds_small).mark_bar().encode(
    alt.X('carat', bin=alt.BinParams(step=0.1)),
    alt.Y('count()')
)

Continuous Variables: Typical Values

alt.Chart(diamonds_small).mark_bar().encode(
    alt.X('carat', bin=alt.BinParams(step=0.01)),
    alt.Y('count()')
)

Discussion questions

  1. What lessons does this plot teach?
  2. What questions does it raise?

Aside: “A Sunday on La Grande Jatte” by Seurat

Aside: “A Sunday on La Grande Jatte” by Seurat

Unusual numeric values (diamonds)

roadmap

  • case study 1: y dimension in diamonds
    • explore some unusual values
    • three options for handling unusual values
  • case study 2 (next section): cars’ gas mileage

Diamonds: examine unusual values

diamonds['y'].describe()
count    53940.000000
mean         5.734526
std          1.142135
min          0.000000
25%          4.720000
50%          5.710000
75%          6.540000
max         58.900000
Name: y, dtype: float64

Diamonds: examine unusual values

diamonds.loc[(diamonds['y'] < 3) | (diamonds['y'] > 20)] 
carat cut color clarity depth table price x y z
11963 1.00 Very Good H VS2 63.3 53.0 5139 0.00 0.0 0.00
15951 1.14 Fair G VS1 57.5 67.0 6381 0.00 0.0 0.00
24067 2.00 Premium H SI2 58.9 57.0 12210 8.09 58.9 8.06
24520 1.56 Ideal G VS2 62.2 54.0 12800 0.00 0.0 0.00
26243 1.20 Premium D VVS1 62.1 59.0 15686 0.00 0.0 0.00
27429 2.25 Premium H SI2 62.8 59.0 18034 0.00 0.0 0.00
49189 0.51 Ideal E VS1 61.8 55.0 2075 5.15 31.8 5.12
49556 0.71 Good F SI2 64.1 60.0 2130 0.00 0.0 0.00
49557 0.71 Good F SI2 64.1 60.0 2130 0.00 0.0 0.00

Diamonds: sanity check by comparing to 10 random diamonds

diamonds.sample(n=10)
carat cut color clarity depth table price x y z
17152 1.23 Very Good H VS2 62.4 58.0 6848 6.80 6.85 4.26
8388 0.35 Very Good F SI1 61.7 58.0 583 4.51 4.54 2.79
35711 0.30 Premium D VS2 60.9 58.0 911 4.35 4.32 2.64
24475 2.05 Ideal F SI2 62.2 57.0 12743 8.19 8.12 5.07
21128 1.03 Ideal F VVS2 62.4 55.6 9290 6.44 6.49 4.02
23969 2.15 Very Good I SI2 63.4 56.0 12100 8.20 8.15 5.18
32462 0.36 Ideal H VVS1 61.5 56.0 794 4.57 4.60 2.82
33301 0.51 Premium E I1 58.0 60.0 826 5.29 5.23 3.05
46328 0.51 Ideal E VS1 61.5 56.0 1758 5.11 5.16 3.16
38155 0.32 Ideal D VVS2 62.2 56.0 1014 4.41 4.37 2.73

What to do with unusual values?

  1. Drop row
  2. Code value to NA
  3. Winsorize value

Diamonds: option 1 for unusual values: drop

diamonds_clean = diamonds.loc[(diamonds['y'] >= 3) | (diamonds['y'] <= 20)] 
diamonds_clean
carat cut color clarity depth table price x y z
0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43
1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31
2 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31
3 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63
4 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75
... ... ... ... ... ... ... ... ... ... ...
53935 0.72 Ideal D SI1 60.8 57.0 2757 5.75 5.76 3.50
53936 0.72 Good D SI1 63.1 55.0 2757 5.69 5.75 3.61
53937 0.70 Very Good D SI1 62.8 60.0 2757 5.66 5.68 3.56
53938 0.86 Premium H SI2 61.0 58.0 2757 6.15 6.12 3.74
53939 0.75 Ideal D SI2 62.2 55.0 2757 5.83 5.87 3.64

53940 rows × 10 columns

Diamonds: option 2 for unusual values: missing

diamonds['y'] = np.where((diamonds['y'] < 3) | (diamonds['y'] > 20), np.nan, diamonds['y'])
diamonds_clean = diamonds.dropna()
diamonds_clean
carat cut color clarity depth table price x y z
0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43
1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31
2 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31
3 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63
4 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75
... ... ... ... ... ... ... ... ... ... ...
53935 0.72 Ideal D SI1 60.8 57.0 2757 5.75 5.76 3.50
53936 0.72 Good D SI1 63.1 55.0 2757 5.69 5.75 3.61
53937 0.70 Very Good D SI1 62.8 60.0 2757 5.66 5.68 3.56
53938 0.86 Premium H SI2 61.0 58.0 2757 6.15 6.12 3.74
53939 0.75 Ideal D SI2 62.2 55.0 2757 5.83 5.87 3.64

53931 rows × 10 columns

Diamonds: option 3 for unusual values: winsorize

pctile01 = diamonds['y'].quantile(0.01)
pctile99 = diamonds['y'].quantile(0.99)

print(f"1st Percentile: {pctile01}")
print(f"99th Percentile: {pctile99}")
1st Percentile: 4.04
99th Percentile: 8.34

Diamonds: option 3 for unusual values: winsorize

diamonds['y_winsor'] = np.where(diamonds['y'] < pctile01, pctile01, 
                                np.where(diamonds['y'] > pctile99, pctile99, diamonds['y']))
diamonds
carat cut color clarity depth table price x y z y_winsor
0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43 4.04
1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31 4.04
2 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31 4.07
3 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63 4.23
4 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75 4.35
... ... ... ... ... ... ... ... ... ... ... ...
53935 0.72 Ideal D SI1 60.8 57.0 2757 5.75 5.76 3.50 5.76
53936 0.72 Good D SI1 63.1 55.0 2757 5.69 5.75 3.61 5.75
53937 0.70 Very Good D SI1 62.8 60.0 2757 5.66 5.68 3.56 5.68
53938 0.86 Premium H SI2 61.0 58.0 2757 6.15 6.12 3.74 6.12
53939 0.75 Ideal D SI2 62.2 55.0 2757 5.83 5.87 3.64 5.87

53940 rows × 11 columns

When is this useful? Income data, test scores, stock returns. Important when you are using procedures where the estimates are sensitive to outliers like computing a mean or running a regression

how do I know which option to choose?

  • make an educated guess by looking at the data as many ways as possible
  • you often can ask your data provider… but they will quickly grow impatient so try to answer as many questions as possible yourself

Diamonds: what do we know about diamonds with outlier y values?

  • depths look plausible, perhaps a bit in the tails
  • carats are high but plausible
  • prices are high, this would be a red flag but for the fact that carats are high too

Diamonds: what should we actually do?

My take (there is often not a ``right’’ answer or you won’t know the answer without talking to a data provider)

  • Rows where x, y, and z are all zero: set to NA
  • Rows where y > 20: winsorize? (hard to know for sure…)

Summary: handling unusual numeric values

Problem Action
Erroneous row drop row
Erroneous cell set to NA or winsorize

How do I decide which problem I have? Examine unusual values in context of other columns (same row) and other rows (same columns). We will see this again in a future lecture.

How do I decide whether to set to NA or winsorize? Ideally, ask your data provider what’s going on with these values.

Unusual values case study

Introducing the mpg dataset

mpg
manufacturer model displ year cyl trans drv cty hwy fl class
0 audi a4 1.8 1999 4 auto(l5) f 18 29 p compact
1 audi a4 1.8 1999 4 manual(m5) f 21 29 p compact
2 audi a4 2.0 2008 4 manual(m6) f 20 31 p compact
3 audi a4 2.0 2008 4 auto(av) f 21 30 p compact
4 audi a4 2.8 1999 6 auto(l5) f 16 26 p compact
... ... ... ... ... ... ... ... ... ... ... ...
229 volkswagen passat 2.0 2008 4 auto(s6) f 19 28 p midsize
230 volkswagen passat 2.0 2008 4 manual(m6) f 21 29 p midsize
231 volkswagen passat 2.8 1999 6 auto(l5) f 16 26 p midsize
232 volkswagen passat 2.8 1999 6 manual(m5) f 18 26 p midsize
233 volkswagen passat 3.6 2008 6 auto(s6) f 17 26 p midsize

234 rows × 11 columns

Q: Why do some cars have better than typical mileage?

potential_outliers = mpg.loc[(mpg["hwy"] > 40) | ((mpg["hwy"] > 20) & (mpg["displ"] > 5))]
potential_outliers
manufacturer model displ year cyl trans drv cty hwy fl class
23 chevrolet corvette 5.7 1999 8 manual(m6) r 16 26 p 2seater
24 chevrolet corvette 5.7 1999 8 auto(l4) r 15 23 p 2seater
25 chevrolet corvette 6.2 2008 8 manual(m6) r 16 26 p 2seater
26 chevrolet corvette 6.2 2008 8 auto(s6) r 15 25 p 2seater
27 chevrolet corvette 7.0 2008 8 manual(m6) r 15 24 p 2seater
158 pontiac grand prix 5.3 2008 8 auto(s4) f 16 25 p midsize
212 volkswagen jetta 1.9 1999 4 manual(m5) f 33 44 d compact
221 volkswagen new beetle 1.9 1999 4 manual(m5) f 35 44 d subcompact
222 volkswagen new beetle 1.9 1999 4 auto(l4) f 29 41 d subcompact

Note: calling geom_point() more than once!

Q: Why do some cars have better than typical mileage?

base = alt.Chart(mpg).mark_point().encode(
         alt.X('displ:Q', title = "Engine size (displ)"),
         alt.Y('hwy:Q', title = "Gas mileage")
    ).properties(
        width=600, 
        height=400 )

outliers = alt.Chart(potential_outliers).mark_point(
    color='red',
    size=100,
    shape='circle'
    ).encode(
        x='displ:Q',
        y='hwy:Q'
    ).properties(
        width=600, 
        height=400)
plot = base + outliers
plot

Q: Why do some cars have better than typical mileage?

labels = alt.Chart(potential_outliers).mark_text(
    align='left',
    dx=10,  # Adjust horizontal distance of text from the point
    dy=-5   # Adjust vertical distance of text from the point
).encode(
    alt.X('displ:Q', title = "Engine size (displ)"),
    alt.Y('hwy:Q', title = "Gas mileage"),
    text='model:N'  # Display car_model as the label
).properties(
    width=600, 
    height=400)

plot = base + outliers + labels
plot

Q: How are there big engines and good mileage? color

alt.Chart(mpg).mark_point(size=100).encode(
    x='displ:Q',  # Quantitative variable for displacement
    y='hwy:Q',    # Quantitative variable for highway mpg
    color='class:N',  # Categorical variable for class
    tooltip=['displ', 'hwy', 'class']  # Optional: tooltip to display values on hover
)

gas mileage summary

  • Question: Why do some cars have better than typical mileage? (What’s going on with these outliers?)
    • Tools:
      • identify outliers
      • color = class
    • Answer: 2-seaters & subcompact